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■^P , Abstract. We present an algorithm for computing the Lyndon factor- 

' ization of a string that is given in grammar compressed form, namely, 

j~^ I a Straight Line Program (SLP). The algorithm runs in 0{'n} + mn^h) 

■ time and 0{n^) space, where m is the size of the Lyndon factorization, 

n is the size of the SLP, and h is the height of the derivation tree of the 

^^ , SLP. Since the length of the decompressed string can be exponentially 

^*^ ■ large w.r.t. n, m and h, our result is the first polynomial time solution 

• ' when the string is given as SLP. 

o 



1 Introduction 

Compressed string processing (CSP) is a task of processing compressed string 

data without explicit decompression. As any method that first decompresses the 

data requires time and space dependent on the decompressed size of the data, 

j^L I CSP without explicit decompression has been gaining importance due to the 



ever increasing amount of data produced and stored. A number of efficient CSP 
f^ I algorithms have been proposed, e.g., see J16I25I15I12I11I13J . In this paper, we 

fT^ ' present new CSP algorithms that compute the Lyndon factorization of strings. 

A string £ is said to be a Lyndon word if £ is lexicographically smallest among 

its circular permutations of characters of £. For example, aab is a Lyndon word, 

. , , but its circular permutations aba and baa are not. Lyndon words have various 

r> ' and important applications in, e.g., musicology [4], bioinformatics j^, approxi- 

j^ I mation algorithm [52] , string matching |6I2I23| , word combinatorics |10I24| , and 

" ■ ■ free Lie algebras PO] . 

The Lyndon factorization (a.k.a. standard factorization) of a string w, de- 
noted LF{w), is a unique sequence of Lyndon words such that the concatenation 
of the Lyndon words gives w and the Lyndon words in the sequence are lexico- 
graphically non-increasing [5]. Lyndon factorizations are used in a bijective vari- 
ant of Burrows- Wheeler transform [17114) and a digital geometry algorithm [3] . 
Duval [5] proposed an elegant on-line algorithm to compute LF{w) of a given 
string w of length N in 0{N) time. Efffcicnt parallel algorithms to compute the 
Lyndon factorization are also known |1I7J . 

We present a new CSP algorithm which computes the Lyndon factorization 
LF{w) of a string w, when w is given in a grammar-compressed form. Let m 



be the number of factors in LF{w). Our first algorithm computes LF{w) in 
0{n'^ + mrt'h) time and 0{n^) space, where n is the size of a given straight-line 
program (SLP), which is a context-free grammar in Chomsky normal form that 
derives only w, and h is the height of the derivation tree of the SLP. Since the 
decompressed string length \w\ = N can be exponentially large w.r.t. n,m and 
h, our 0(ri^ + mn^h) solution can be efficient for highly compressive strings. 

2 Preliminaries 

2.1 Strings and model of computation 

Let Z" be a finite alphabet. An element of S* is called a string. The length of a 
string w is denoted by \w\. The empty string e is a string of length 0, namely, 
\e\ =0. Let 17+ be the set of non-empty strings, i.e., S^ = E* — {e}. For a string 
w = xyz, X, y and z are called a prefix, substring, and suffix of w, respectively. 
A prefix x of w is called a proper prefix oi w ii x ^ w, i.e., x is shorter than 
w. The set of suffixes of w is denoted by Suffix{w). The i-th character of a 
string w is denoted by it;[i], where 1 < i < \w\. For a string w and two integers 
1 ^ ^ ^ j ^ l^^li let ^[1..^/] denote the substring of w that begins at position 
i and ends at position j. For convenience, let w[i..j] = s when i > j. For any 
string w let w^ = it;, and for any integer fc > 2 let w'^ ~ w'w^~^ , i.e., w^ is a 
/c-time repetition of w. 

A positive integer p is said to be a period of a string w if ^[i] = w[i + p] for 
all 1 < j < |w| — p. Let It; be any string and q be its smallest period. If p is a 
period of a string w such that p < \w\, then the positive integer \w\ — p is said 
to be a border of w. If ui has no borders, then w is said to be border-free. 

If character a € Z" is lexicographically smaller than another character b € S, 
then we write a ^ b. For any non-empty strings x,y €: S^, let lcp{x,y) be the 
length of the longest common prefix of x and y. We denote a; ^ j/, if either 
of the following conditions holds: x[lcp{x,y) + 1] -< y[lcp{x,y) + 1], or x is a 
proper prefix of y. For a set S C S^ of non-empty strings, let min^ S denote 
the lexicographically smallest string in S. 

Our model of computation is the word RAM: We shall assume that the 
computer word size is at least [log2|it;|], and hence, standard operations on 
values representing lengths and positions of string w can be manipulated in 
constant time. Space complexities will be determined by the number of computer 
words (not bits). 



2.2 Lyndon words and Lyndon factorization of strings 

Two strings x and y arc said to be conjugate, if there exist strings u and v 
such that X — uv and y = vu. A string w is said to be a Lyndon word, if w is 
lexicographically strictly smaller than all of its conjugates of w. Namely, u; is a 
Lyndon word, if for any factorization w — uv, it holds that uv -< vu. It is known 
that any Lyndon word is border-free. 
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Fig. 1. The derivation tree of SLP S 

XiXs, X5 — !> XzXi, Xe — >■ X4X5, Xj - 
aababaababaab . 



-- {Xl ^ a, X2 ^ b, Xa -^ XiXa, X4, 
XaX^}, representing string S — val{X7) 



Definition 1 (j5j). The Lyndon factorization of a string w, denoted LF{w), 
is the factorization £^^ • • • £^ of w, such that each £i € S^ is a Lyndon word, 
Pi > 1, and ti >- ii^i for all 1 < i < m. 



It is known that the Lyndon factorization is unique for each string w, and it 
was shown by Duval [3] that the Lyndon factorization can be computed in 0{N) 
time, where N = \w\. 

LF{w) can be represented by the sequence (|^i|,j>i), . . . , (|^m|,Pm) of integer 
pairs, where each pair (|£i|,Pi) represents the i-th Lyndon factor £^' of w. Note 
that this representation requires 0{m) space. 



2.3 Straigiit line programs 



A straight line program (SLP) is a set of productions S = {Xi — > expri,X2 — > 
expr2T ■ ■ ■ , Xn — >■ exprn}, where each Xi is a variable and each expri is an 
expression, where expri = a {a G S), or expri = X^^i-jX^^i) (i > i{i),r{i)). It 
is essentially a context free grammar in Chomsky normal form, that derives a 
single string. Let val{Xi) represent the string derived from variable Xi. To ease 
notation, we sometimes associate val{Xi) with Xi and denote |?;a/(Xi)| as \Xi\, 
and val{Xi)[u..v] as Xi[u..v] for 1 < u < v < \Xi\. An SLP S represents the 
string w = val{Xn). The size of the program S is the number n of productions 
in S. Let N be the length of the string represented by SLP 5, i.e., N = \w\. 
Then N can be as large as 2"^^. 

The derivation tree of SLP 5 is a labeled ordered binary tree where each 
internal node is labeled with a non-terminal variable in {Xi, . . . , X„}, and each 
leaf is labeled with a terminal character in E. The root node has label X„. An 
example of the derivation tree of an SLP is shown in Fig. [1] 



3 Computing Lyndon factorization from SLP 

In this section, we show how, given an SLP S of n productions representing 
string w, we can compute LF{w) of size m in 0(n'^ + mn^h) time. We wih make 
use of the following known results: 

Lemma 1 ([9]). For any string w, let LF{w) = ^^^...,£^7^. Then, £,„ = 
min^ Suffixiw), i.e., £m is the lexicographically smallest suffix of w. 

Lemma 2 ( [18] ). Given an SLP S of size n representing a string w of length 
N , and two integers 1 < i < J < N , we can compute in 0{n) time another SLP 
of size 0{n) representing the substring w[i..j]. 

Lemma 3 ( [18] ) . Given an SLP S of size n representing a string w of length 
N, we can compute the shortest period of w in 0{n^ log N) time and 0{n'^) 
space. 

For any non-empty string w G 17+, let LFCand(w) = {x\ x £ Suffix{w),3y G 
17+ s.t. xy = min^ Suffix{wy)}. Intuitively, LFCand(w) is the set of suffixes of 
w which are a prefix of the lexicographically smallest suffix of string wy, for 
some non-empty string y € 17+ . 

The following lemma may be almost trivial, but will play a central role in 
our algorithm. 

Lemma 4. For any two .strings u,v € LFCand(w) with \u\ < \v\, u is a prefix 
ofv. 

Proof. If u[l..|u|] -< u, then for any non-empty string y, vy -< uy. However, this 
contradicts that u g LFCand{w). If i'[l..|M|] >- u, then for any non-empty string 
y, vy >- uy. However, this contradicts that v £ LFCandiw). Hence we have 

'(;[l..|u|] =u. D 

Lemma 5. For any string w, let i — min^ Suffix{w). Then, the shortest string 
of LP C and [w) is P , where p > 1 is the maximum integer such that P' is a suffix 
ofw. 

Proof. For any string x G LFCand{w), and any non-empty string y, xy = 
min^ Suffix{wy) holds only ii y ^^ £. 

Firstly, we compare i^ with the suffixes s of w shorter than i^, and show that 
£Py ~< sy holds for any y > L Such suffixes s are divided into two groups: (1) If 
s is of form l^ for any integer 1 < fc < p, then Py -< (J^y = sy < y holds for any 
y>-t, (2) If s is not of form (.'', then since £ is border-free, H is not a prefix of s, 
and s is not a prefix of I, either. Thus tP < s holds, implying that (Py -< sy for 
any y y £. 

Secondly, we compare i^ with the suffixes i of if longer than £p, and show 
that Py -< ty holds for some y y i. By Lemma IH t = i'^u holds, where q > p is 
the maximum integer such that £'' is a prefix of t, and u e £"+. By definition, 
i ^ u and £ is not a prefix of u. Choosing y — £'^~Pu' with u' -< u, we have 
iPy = £'iu' -< £'^u ^ t ^ ty. Hence, £p G LFCand{w) and no shorter strings exist 
in LFCand{w). D 



By Lemma [T] and Lemma [SJ computing the last Lyndon factor i'J^' of u; = 
val{Xn) reduces to computing LFCand{Xn) for the last variable Xn- In what fol- 
lows, we propose a dynamic programming algorithm to compute LFCand(Xi) for 
each variable. Firstly we show the number of strings in LFCand{Xi) is 0(log iV), 
where N = |m/(X„)| = \w\. 

Lemma 6. For any string w, let Sj be the jth shortest string of LFCand{w). 
Then, |sj+i| > 2|.Sj| for any I < j < \LFCand{w)\. 

Proof. Let £ ~ min^ Suffix{w), and y any string such that y ;^ £. It follows from 
Lemma m that £ is a prefix of any string Sj G LFCand{w), and hence Sj -< y 
holds. 

Assume on the contrary that |sj+i| < 2|sj|. If |sj+i| = 2|sj|, i.e., Sj+i = SjSj, 
then Sj+ij/ = SjSjy -< Sjy holds, but this contradicts that Sj G LFCand{w). 
Hence Sj+i 7^ SjSj. If |sj+i| < 2|sj|, by Lemma |4l Sj is a prefix of Sj+i, and 
therefore Sj has a period q such that Sj+i = u'^v and Sj = u^~^v^ where u = 
Sj[\..q]^ fc > 1 is an integer, and w is a proper prefix of u. There are two cases 
to consider: (1) If uvy -< vy, then u^vy -< u^~^vy = Sjy. (2) If vy -< uvy, then 
vy -< uvy -< u^vy -< ■ ■ ■ ^ u^~^vy = Sjy. It means that imn^{u^vy, vy} -< Sjy for 
any y "^ £, however, this contradicts that Sj G LFCand{w). Hence |sj+i| > 2|Sj| 
holds. D 

Since Sj is a sufhx of Sj+i, it follows from Lemma H] and Lemma IHl that Sj+i = 
Sjtsj with some non-empty string t G U~^ . This also implies that the number of 
strings in LFCand{w) is 0(log A^), where N is the length of w. By identifying 
each sufhx of LFCand{Xi) with its length, and using Lemma El LFCand{Xi) for 
all variables can be stored in a total of 0(n log A^) space. 

For any two variables Xi, Xj of an SLP S and a positive integer k satisfying 
|A,;| > fc + \Xj\ - 1, consider the FM function such that FM{Xi,Xj,k) = 
lcp{val{Xi)[k..\Xi\], val{Xj)), i.e., it returns the length of the Icp of the suffix of 
val{Xi) starting at position k and Xj. 

Lemma 7 ( |21lll9p . We can preprocess a given SLP S of size n in 0{n'^) time 
and 0{n^) space so that FM{Xi,Xj,k) can be answered in 0{n^) time. 

For each variable Xi we store the length \Xi\ of the string derived by Xi. It 
requires a total of 0{n) space for all 1 < i < n, and can be computed in 
a total of 0(n) time by a simple dynamic programming algorithm. Given a 
position j of the uncompressed string w of length A^, i.e., 1 < j < A^, we 
can retrieve the jth character w[j] in 0{n) time by a simple binary search on 
the derivation tree of Xn using the lengths stored in the variables. Hence, we 
can lexicographically compare wa?(Ai)[fc..|Ai|] and val{Xj) in O(n^) time, after 
0(n'^)-time preprocessing. 

The following lemma shows a dynamic programming approach to compute 
LFCand{Xi) for each variable Xi. We will mean by a sorted list of LFCand{Xi) 
the list of the elements of LFCand{Xi) sorted in increasing order of length. 
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Fig. 2. Lemma |S] Initially Di = LFCand{Xr) and h = s ■ val{Xi) with s being the 
shortest string of LFCand{Xi). 



Lemma 8. Let Xi = XfXr be any production of a given SLP S of size n. Pro- 
vided that sorted lists for LFCand{Xi) and LFCand{Xr) are already computed, 
a sorted list for LFCand{Xi) can he computed in 0{n^) time and Oin?) space. 

Proof. Let Di be a sorted list of the suffixes of Xi that are candidates of elements 
of LFCand{X,). We initially set Di ^ LFCand{Xr). 

We process the elements of LFCand{Xi) in increasing order of length. Let s 
be any string in LFCand{Xi), and d the longest string in Di. Since any string of 
LFCand(Xr) is a prefix of d by Lemma 21 in order to compute LFCand{Xi) it 
suffices to lexicographically compare s-val{Xr) and d. Let h — lcp{s-val{Xr), d)). 
Sec also Fig. [H 

— If (s ■ val{Xr))[h + 1] -< d[h + 1], then s ■ val{Xr) -< d. Since any string in 
Di is a prefix of d by Lemma SI we observe that any clement in Di that is 
longer than h cannot be an element of LFCand{Xi). Hence we delete any 
element of Di that is longer than h from Di, then add s ■ val{Xr) to Z?^, and 
update d^ s ■ val{X^). See also Fig. [3] 

— If (s- val{Xr))\h-\-l\ >~ d[h + l], then s- val{Xr) >- d. Since s- val{Xr) cannot 
be an element of LFCand{Xi), in this case neither Di nor d is updated. See 
also Fig. m 

— If ft, = |(i|, i.e., d is a prefix of s ■ val{Xr), then there are two sub-cases: 



val{Xr)\ < 2|(i|, d has a period q such that s ■ 



ll{Xr 



and 



If _ , ,, 

d = u^~^v, where u = d[l..q], fc > 1 is an integer, and w is a proper 
prefix of M. By similar arguments to Lemma |6l we observe that d cannot 
be a member of LFCand{Xi) while s ■ val{Xr) may be a member of 
LFCand{Xi). Thus we add s ■ val{Xr) to Di, delete d from Di, and 
update d -tr- s ■ val{Xr). See also Fig. [5j 

If |s • val{Xr)\ > 2|d|, then both d and s ■ val{Xr) may be a member of 
LFCand{Xi). Thus we add s ■ val{Xr) to Di, and update d •<— s • val{Xr). 
See also Fig. [6l 




Fig. 3. LemmalHl Case where (s ■ val(Xr))[h+ 1] = a -< d[h+ 1] 
in Di tliat is longer tlian h are deleted from Di. Then s • val{Xr 
candidate in Di. 
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Fig. 4. Lemma [HI Case where (s • val(Xry)\h + 1] = a ;^ d[/i + 1] = /3. There are no 
updates on Di. 




Fig. 5. Leinina[8l Case where h — \d\ and \s ■ val{Xr)\ < 2|d|. Since s ■ val{Xr) — u^v 



and d ■- 



d is deleted from Di and s ■ val{Xr) is added to Di 
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Fig. 6. Lemma |8l Case wliere h = \d\ and \s ■ val{Xr)\ 
Di, and s ■ val{Xr) becomes the longest member of Di. 



> 2\d\. We add s ■ val{Xr) to 




Fig. 7. Lemma[Hl lcp{z,d) = mm{lcp{Z,X„+i[\Xi\ - \d\ + l..|X„+i|]), ld|}. 



We represent the strings in LFCand{Xe), LFCand{Xr), LFCand{Xi), and Di 
by their lengths. Given sorted hsts of LFCand{X() and LFCand{Xr), the above 
algorithm computes a sorted list for Di , and it follows from Lemma [6] that the 
number of elements in Di is always 0(log A^). Thus all the above operations on 
Di ean be conducted in 0(log A^) time in each step. 

We now show how to efficiently compute h = lcp{s ■ val(Xr),d), for any 
s G LFCand{Xi). Let z be the longest string in LFCand{Xi), and consider to 
process any string s € LFCand{X(). Since s is a prefix of z by Lemma |4j we can 
compute lcp{s ■ val{Xr),d) as follows: 



lcp{s ■ val{Xr), d) 



lcp{z,d) 

\s\ + lcp{Xr,d[\s\ 



1., 



if lcp{z,d) < \s\, 
iilcp{z,d) > \s\. 



To compute the above Icp values using the FM function, for each variable Xi of 
S we create a new production X„+i — XiXi, and hence the number of variables 
increases to 2n. In addition, we construct a new SLP of size 0{n) that derives 
z in 0{n) time using Lemma [2] Let Z be the variable such that val{Z) = z. It 
holds that 



lcp{z,d) =mm{lcp{Z,X„+^[\X^\ - \d\ + l..\Xn+t\]),\d\} and 

lcp{Xr, d[\s\ + l..\d\]) = mm{lcp{Xr,X„+r[\Xr\ - \d\ + \s\ + l..|X„+,|]), \d\ - \s\}. 



See also Fig. [7] and Fig. M 

By using Lemma [71 we preprocess, in 0{n^) time and O(n^) space, the 
SLP consisting of these variables so that the query FM{Xi, Xj, k) for answering 
lcp{Xi[k..\Xi\],Xj) is supported in O(n^) time. Therefore lcp{s ■ val{Xr),d) can 




Fig. 8. 
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d[\s\ + l..|d|]) = mm{lcp{Xr,Xn+r[\Xr\ - \d\ + \s\ + 



be computed in 0{n^) time for each s E LFCand{Xi). Since there exist O(logiV) 
elements in LFCand{Xi), we can compute LFCand{Xi) in O^n? + n^ log A^) = 
0{n^) time. The total space complexity is 0{ii?). D 

Since there are n productions in a given SLP, using Lemma [5] we can compute 
LFCand{Xn) for the last variable X„ in a total of 0{n'^) time. The main result 
of this paper follows. 

Theorem 1. Given an SLP S of size n representing a string w, we can compute 
LF{w) in Oiji^ + mn^h) time and 0{n^) space, where m is the number of factors 
in LF{w) and h is the height of the derivation tree of S. 

Proof. Let LF{w) — l\^ ■ ■ -^f^"*. First, using Lemma H] we compute LFCand for 
all variables in S in 0{n'^) time. Next we will compute the Lyndon factors from 
right to left. Suppose that we have already computed ^^+Y ' ' ' ^m" ^ ^-^d we are 
computing the jth Lyndon factor i^^ . Using Lemma[5J we construct in 0{n) time 
a new SLP of size 0(n) describing u;[l..|u'| — X^I— j+i Pfc^fcl], which is the prefix 
of w obtained by removing the sufhx £^lY ' ' ' ^n™ from w. Here we note that the 
new SLP actually has 0{h) new variables since u'[l..|w| — X]feLj+iPfcKfc|] can be 
represented by a sequence of 0{h) variables in 5. Let Y be the last variable of 
the new SLP. Since LFCand for all variables in S have already been computed, 
it is enough to compute LFCand for 0{h) new variables. Hence using Lemma|Hl 
we compute a sorted list of LFCand{Y) — LFCand{w[l..\w\ — X^feLj+iPfeKfel]) 
in a total of 0{n^h) time. It follows from Lemma [5] that the shortest element 
of LFCand{Y) is £^^, the jth Lyndon factor of w. Note that each string in 
LFCand{Y) is represented by its length, and so far we only know the total length 
Pj\ij\ of the jth Lyndon factor. Since £j is border free, |£j| is the shortest period 



10 



of £^^ . We construct a new SLP of size 0{n) describing £^-^ , and compute \£j\ in 
0{n'^ log A^) time using Lemma [S] We repeat the above procedure m times, and 
hence LF{w) can be computed in a total oiO{n^-\-ra{n?h + ii? logiV)) = 0{n'^ + 
mn^h) time. To compute each Lyndon factor of LF{w), we need 0{-n?) space 
for Lemma 13] and Lemma [5] Since LFCand{Xi) for each variable Xi requires 
0(log A^) space, the total space complexity is 0(77.^ + nlog A^) = O(n^). D 

4 Conclusions and open problem 

Lyndon words and Lyndon factorization arc important concepts of combinatorics 
on words, with various applications. Given a string in terms of an SLP of size 
n, we showed how to compute the Lyndon factorization of the string in 0{n^ + 
miT'h) time using 0(in?) space, where ra is the size of the Lyndon factorization 
and h is the height of the SLP. Since the decompressed string length N can be 
exponential w.r.t. n, m and h, our algorithm can be useful for highly compressive 
strings. 

An interesting open problem is to compute the Lyndon factorization from 
a given LZ78 encoding [55]. Each LZ78 factor is a concatenation of the longest 
previous factor and a single character. Hence, it can be seen as a special class of 
SLPs, and this property would lead us to a much simpler and/or more efficient 
solution to the problem. Noting the number s of the LZ78 factors is Q{-\fN), a 
question is whether we can solve this problem in o{s'^) + 0{m) time. 
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